{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extending your Metadata using DocumentClassifiers at Index Time\n",
    "\n",
    "With DocumentClassifier it's possible to automatically enrich your documents with categories, sentiments, topics or whatever metadata you like. This metadata could be used for efficient filtering or further processing. Say you have some categories your users typically filter on. If the documents are tagged manually with these categories, you could automate this process by training a model. Or you can leverage the full power and flexibility of zero shot classification. All you need to do is pass your categories to the classifier, no labels required. This tutorial shows how to integrate it in your indexing pipeline."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DocumentClassifier adds the classification result (label and score) to Document's meta property.\n",
    "Hence, we can use it to classify documents at index time. \\\n",
    "The result can be accessed at query time: for example by applying a filter for \"classification.label\"."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Preparing the Colab Environment\n",
    "\n",
    "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing Haystack\n",
    "\n",
    "To start, let's install the latest release of Haystack with `pip`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Install the latest main of Haystack\n",
    "pip install --upgrade pip\n",
    "pip install farm-haystack[colab,ocr,preprocessing,file-conversion,pdf,elasticsearch,inference]\n",
    "\n",
    "apt install libgraphviz-dev\n",
    "pip install pygraphviz"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enabling Telemetry \n",
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.telemetry import tutorial_running\n",
    "\n",
    "tutorial_running(16)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Logging\n",
    "\n",
    "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
    "Example log message:\n",
    "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
    "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
    "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read and preprocess documents\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.utils import fetch_archive_from_http\n",
    "\n",
    "\n",
    "# This fetches some sample files to work with\n",
    "doc_dir = \"data/tutorial16\"\n",
    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial16.zip\"\n",
    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.nodes import PreProcessor\n",
    "from haystack.utils import convert_files_to_docs\n",
    "\n",
    "# note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents\n",
    "all_docs = convert_files_to_docs(dir_path=doc_dir)\n",
    "preprocessor_sliding_window = PreProcessor(split_overlap=3, split_length=10, split_respect_sentence_boundary=False)\n",
    "docs_sliding_window = preprocessor_sliding_window.process(all_docs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Apply DocumentClassifier"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can enrich the document metadata at index time using any transformers document classifier model. While traditional classification models are trained to predict one of a few \"hard-coded\" classes and required a dedicated training dataset, zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. Just supply them via the labels param.\n",
    "Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'. Feel free to change them for whatever you like to classify. \\\n",
    "These classes can later on be accessed at query time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.nodes import TransformersDocumentClassifier\n",
    "\n",
    "\n",
    "doc_classifier = TransformersDocumentClassifier(\n",
    "    model_name_or_path=\"cross-encoder/nli-distilroberta-base\",\n",
    "    task=\"zero-shot-classification\",\n",
    "    labels=[\"music\", \"natural language processing\", \"history\"],\n",
    "    batch_size=16,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can also use any other transformers model besides zero shot classification\n",
    "\n",
    "# doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'\n",
    "# doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model, batch_size=16, use_gpu=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we could also specifiy a different field we want to run the classification on\n",
    "\n",
    "# doc_classifier = TransformersDocumentClassifier(model_name_or_path=\"cross-encoder/nli-distilroberta-base\",\n",
    "#    task=\"zero-shot-classification\",\n",
    "#    labels=[\"music\", \"natural language processing\", \"history\"],\n",
    "#    batch_size=16, use_gpu=-1,\n",
    "#    classification_field=\"description\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# classify using gpu, batch_size makes sure we do not run out of memory\n",
    "classified_docs = doc_classifier.predict(docs_sliding_window)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.\n",
    "print(classified_docs[0].to_dict())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start an Elasticsearch server\n",
    "You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recommended: Start Elasticsearch using Docker via the Haystack utility function\n",
    "from haystack.utils import launch_es\n",
    "\n",
    "launch_es()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start an Elasticsearch server in Colab\n",
    "\n",
    "If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
    "tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
    "chown -R daemon:daemon elasticsearch-7.9.2\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "%%bash --bg\n",
    "\n",
    "sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to Elasticsearch\n",
    "import os\n",
    "import time\n",
    "\n",
    "from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore\n",
    "\n",
    "# Wait 30 seconds only to be sure Elasticsearch is ready before continuing\n",
    "time.sleep(30)\n",
    "\n",
    "# Get the host where Elasticsearch is running, default to localhost\n",
    "host = os.environ.get(\"ELASTICSEARCH_HOST\", \"localhost\")\n",
    "\n",
    "document_store = ElasticsearchDocumentStore(host=host, username=\"\", password=\"\", index=\"document\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now, let's write the docs to our DB.\n",
    "document_store.delete_all_documents()\n",
    "document_store.write_documents(classified_docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if indexed docs contain classification results\n",
    "test_doc = document_store.get_all_documents()[0]\n",
    "print(\n",
    "    f'document {test_doc.id} with content \\n\\n{test_doc.content}\\n\\nhas label {test_doc.meta[\"classification\"][\"label\"]}'\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying the data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All we have to do to filter for one of our classes is to set a filter on \"classification.label\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize QA-Pipeline\n",
    "from haystack.pipelines import ExtractiveQAPipeline\n",
    "from haystack.nodes import FARMReader, BM25Retriever\n",
    "\n",
    "\n",
    "retriever = BM25Retriever(document_store=document_store)\n",
    "reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)\n",
    "pipe = ExtractiveQAPipeline(reader, retriever)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Voilà! Ask a question while filtering for \"music\"-only documents\n",
    "prediction = pipe.run(\n",
    "    query=\"What is heavy metal?\",\n",
    "    params={\"Retriever\": {\"top_k\": 10, \"filters\": {\"classification.label\": [\"music\"]}}, \"Reader\": {\"top_k\": 5}},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.utils import print_answers\n",
    "\n",
    "\n",
    "print_answers(prediction, details=\"high\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrapping it up in an indexing pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from haystack.pipelines import Pipeline\n",
    "from haystack.nodes import TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter\n",
    "\n",
    "\n",
    "file_type_classifier = FileTypeClassifier()\n",
    "text_converter = TextConverter()\n",
    "pdf_converter = PDFToTextConverter()\n",
    "docx_converter = DocxToTextConverter()\n",
    "\n",
    "indexing_pipeline_with_classification = Pipeline()\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=file_type_classifier, name=\"FileTypeClassifier\", inputs=[\"File\"]\n",
    ")\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=text_converter, name=\"TextConverter\", inputs=[\"FileTypeClassifier.output_1\"]\n",
    ")\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=pdf_converter, name=\"PdfConverter\", inputs=[\"FileTypeClassifier.output_2\"]\n",
    ")\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=docx_converter, name=\"DocxConverter\", inputs=[\"FileTypeClassifier.output_4\"]\n",
    ")\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=preprocessor_sliding_window,\n",
    "    name=\"Preprocessor\",\n",
    "    inputs=[\"TextConverter\", \"PdfConverter\", \"DocxConverter\"],\n",
    ")\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=doc_classifier, name=\"DocumentClassifier\", inputs=[\"Preprocessor\"]\n",
    ")\n",
    "indexing_pipeline_with_classification.add_node(\n",
    "    component=document_store, name=\"DocumentStore\", inputs=[\"DocumentClassifier\"]\n",
    ")\n",
    "# Uncomment the following to generate the pipeline image\n",
    "# indexing_pipeline_with_classification.draw(\"index_time_document_classifier.png\")\n",
    "\n",
    "document_store.delete_documents()\n",
    "txt_files = [f for f in Path(doc_dir).iterdir() if f.suffix == \".txt\"]\n",
    "pdf_files = [f for f in Path(doc_dir).iterdir() if f.suffix == \".pdf\"]\n",
    "docx_files = [f for f in Path(doc_dir).iterdir() if f.suffix == \".docx\"]\n",
    "indexing_pipeline_with_classification.run(file_paths=txt_files)\n",
    "indexing_pipeline_with_classification.run(file_paths=pdf_files)\n",
    "indexing_pipeline_with_classification.run(file_paths=docx_files)\n",
    "\n",
    "document_store.get_all_documents()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can store this pipeline and use it from the REST-API\n",
    "indexing_pipeline_with_classification.save_to_yaml(\"indexing_pipeline_with_classification.yaml\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
